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Abstract 

In this article, we propose a new method for the fundamental task of testing for depen¬ 
dence between two groups of variables. The response densities under the null hypothesis of 
independence and the alternative hypothesis of dependence are specified by nonparametric 
Bayesian models. Under the null hypothesis, the joint distribution is modeled by the product 
of two independent Dirichlet Process Mixture (DPM) priors; under the alternative, the full 
joint density is modeled by a multivariate DPM prior. The test is then based on the posterior 
probability of favoring the alternative hypothesis. The proposed test not only has good per¬ 
formance for testing linear dependence among other popular nonparametric tests, but is also 
preferred to other methods in testing many of the nonlinear dependencies we explored. In the 
analysis of gene expression data, we compare different methods for testing pairwise dependence 
between genes. The results show that the proposed test identifies some dependence structures 
that are not detected by other tests. 

Key WORDS: Test of independence, nonparametric Bayesian, Dirichlet process mixture, re¬ 
versible jump MCMC. 



1 Introduction 


A fundamental task in statistics is to determine whether two groups of variables are depen¬ 
dent. For example, in genomic analysis, we might want to test whether two groups of genes 
are associated to identify dependence between genetic pathways. In the brain imaging research, 
we may want to discover whether sets of voxels from different parts of the brain are related to 
explore functional connectivity. In general, high-dimensional data analysis can be simplified by 
identifying sets of independent variables. 

Testing of dependence is often reduced to testing for linear dependence. Pearson correlation 
coefficient is a classical and widely-used method for quantifying the strength of linear dependence 


between two univariate variables. Spearman’s rank correlation coefficient (Spearman 1904) is 
a ranked-based version of Pearson correlation coefficient which quantifies monotone correlation. 
Tests based on correlation are powerful for testing specific types of association, but lose power for 
other general types. 

For testing more general associations, the y 2 test of independence and Hoeffding’s test of 


independence (Hoeffding 1948) are two classical nonparametric methods. These tests are based 
on partitioning data into a contingency table. The main drawback for y 2 test is that the result is 
sensitive to the way the data are partitioned. Several approximations of the test statistics of the 


Hoeffding’s test are studied: Blum et al. (1961) introduce an approximation by the concordances 


and discordances of a 2 X 2 contingency tables, and Wilding & Mudholkar (2008) propose an 
approximation by using two Weibull extensions. A relation between the Hoeffding’s test and the 
X' 2 test statistics was noted by Thas &; Ottoy (2004), and they also suggested extending the idea 
of Blum et al. (1961) to a k x k contingency tables, for k > 2. More recent methods related to 


the Hoeffding’s test have been proposed by Heller et al. (2013) and Kaufman et al. (2013). Both 
of these tests are consistent under general types of associations. Other methods for testing for 
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(2007) can be extended to higher dimensions for testing joint independence of two or more random 


vectors. Several Bayesian methods are available for testing of independence. The simplest test 
of linear dependence between two univariate random variables can be achieved by fitting a linear 
model and inspecting the posterior distribution of the correlation coefficient. Other methods were 
proposed for testing of independence based on a contingency table (Nandram Sz Choi 2006, 2007 


Nandram et al. 2013). 


In this article, we propose a nonparametric Bayesian test of independence between two groups 
of variables. We test the null hypothesis of independence and the alternative hypothesis of depen¬ 
dence. We specify nonparametric Bayesian models for the response density under both hypotheses. 
Under the null hypothesis, the joint distribution is taken to be the product of two independent 
densities, both with nonparametric priors; under the alternative, the full joint density has a non¬ 
parametric prior. The test is based on the posterior probability of the alternative hypothesis. 
By specifying nonparametric Bayesian models under each hypothesis, we obtain an extremely 
flexible test which can capture both linear and complex nonlinear relationships between groups 
of variables. 

The remainder of the article proceeds as follows. In Section [2j we introduce the statistical 
algorithm. The details of the reversible jump MCMC algorithm use to compute the posterior 
probability of the alternative hypothesis are provided in Section [3| In Section [4j we present a 
simulation study to compare the power of the proposed test with other tests of linear and non¬ 
linear relationships. The method is illustrated using a genetic data analysis in Section [5] Section 
[H concludes. 


2 Statistical model 

Let Xi £ Wi Dl and X 2 £ R-° 2 be random vectors in D\ and D 2 dimensions, respectively, 
and denote X = ( X\,X 2 )• The objective is to test whether X\ and X 2 are independent. The 
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hypotheses are 


Ho : X\ and X 2 are independent and f(X) = /i(Wi)/ 2 (W 2 ) 

Hi : X 1 and X 2 are dependent and f{X ) cannot be factorized 

In other words, when they are independent, the joint density can be factorized as the product of 
two lower-dimensional densities. 

Under both hypotheses, the densities are modeled using Dirichlet process mixture (DPM) 
prior. Under Ho, fi(X\) and follow independent DPM priors; under Hi when Xi and 

W 2 are not independent, the joint distribution is assumed to follow a DPM prior. The following 
subsections describe the independent and joint DPM priors. 

2.1 The independent DPM prior 

When Xi and X 2 are independent, fj{Xj), j = 1,2, are assumed to follow the DPM prior 
independently. The DPM prior can be written as the infinite mixture 

OO 

fj(Xj) = ^2 w ij ( t > j(Xj | (1) 

1=1 

where wij is the mixture weight, (j>j is assigned to be the Dj —dimensional multivariate normal 
distribution (MVN) in this analysis, fj,^ is the mean vector of the I th mixture component, and Sj 
is the covariance matrix. 

The mixture weights are modeled by the stick-breaking construction with concentration 
parameter dj. The weights wij are modeled in terms of latent vij ~ Beta(l,dj). The first 
weight is w\j = v\j. The remaining elements are modeled as = % n!=i (1 v ij )> where 
nti(l — Vij) = 1 — Y2i=i w ij i s the remaining probability after accounting first l — 1 mixture 
weights. The number of mixture components is truncated by a sufficiently large number K (i.e. 
I = 1,..., K ), where the last term vk is fixed to be 1 to ensure that Xwli w ij = 1- 
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The mean vectors /x^ have priors /Xy ~ MVN(0. Clj ). The covariance matrices Xj and Clj 
are parameterized as X; = rSj, and Clj = (1 — r)Sj. Under this model, Sj is the covariance 
matrix for Xj marginally over the mixture means /x^-, and r is the proportion of the total variance 
attributed to the variance within each mixture component. The marginal covariance Sj is assigned 
to have inverse Wishart prior distribution, and to facilitate computing, the prior of r is a discrete 
uniform distribution with support r £ {0,0.01,..., 1}. The concentration parameter dj has prior 
distribution Gamma(a, b). 

2.2 The joint DPM prior 

When X\ and X 2 are not independent, f(X) is assumed to follow the joint DPM prior 

OO 

f{X) = Y J W l (j>{X | /*„£), ( 2 ) 

1=1 

where wi is the mixture weight, cj) is the (D\ + D 2 )—dimensional MVN distribution, /x^ is the mean 
vector of the I th mixture component, and X is the covariance matrix. The number of mixtures is 
truncated by the same number K as in the independent model. The mixture weights wi are again 
modeled by the stick-breaking algorithm with concentration parameter d. The mean vectors /x ; 
have priors /x^ ~ 0(0, O) . The covariance matrices 51 and Cl are modeled as S = rdiag(S^) and 
O = (1 — r)S, where S is the covariance matrix for X, and diag(S) is the diagonal form of S. In 
other words, under the joint DPM prior, we assign non-diagonal structure for the Cl, and diagonal 
structure for the 51. We found that diagonalizing S greatly improved computational stability . 
The priors for S, r, and d are the same as in the independent DPM prior. 

2.3 Bayesian test of independence 

The Bayesian hypothesis test of independence is based on the Bayes factor (BF) 

= P(H, | X)/P(Hp | X) = P(X 1 HQ 
P(Hi)/P(H 0 ) P(X | Ho)' 
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The null is rejected if BF > T, where T is a threshold parameter. The threshold parameter T 
can be chosen based on rules of thumb about the weight of evidence favoring Hi. For example, 


Kass & Raftery (1995) suggest that BF = 10 is a strong evidence for Hi. Alternatively, in the 


simulation study in Section [4j we select T to control the Type I error rate. In the analysis of 
genetic data in Section [5j multiple tests are performing simultaneously, therefore we select T to 
control the Bayesian false discovery rate. 


3 Computing details 

Computing the Bayes factor requires computing the posterior probability of each hypothesis. 
This is accomplished using a reversible jump MCMC (RJMCMC) algorithm as described below. 

3.1 Reparameterization and hyperparameters 

The updating algorithm of the DPM prior is facilitated by introducing the equivalent clustering 
model. The mixture form in ([Tj) can be written as 

fji^j | gj = l) = (j)j{Xj | HijjYlj'), 


which draws an auxiliary cluster label g 3 E {1, ...,1F} with P(gj = l) = wij. Similarly, the model 
in ([2]) is equivalent to 

f(X \g = l) = <f>(X | 

with cluster label g and P{g = l) = wi. Under the clustering model, the full conditionals of all 
the parameters are conjugate. 

In addition, we introduce model indicator parameter M, where 


M E 


/ 

J 


if X ] and X 2 are independent (Ho is true) 
if X 1 and X 2 are not independent (Hi is true). 
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Under each MCMC step, we propose a new indicator M' in the Markov chain, and decide whether 
to accept the new status M'. The probability P(Hi | X) is then approximated by Yli=i L(MW = 
J)/N, where N is the number of MCMC samples and AfW is the model status for the i th MCMC 
sample. 

Throughout this article, we let the number of mixture components truncated at K = 20 and 
the hyperparameters in the stick-breaking procedure (a, b) are fixed under different sample sizes 
n as presented in Table [I] 

Table 1: Hyperparameters (a, b) under different sample sizes n. 


n 

a 

b 

100 

1.5 

2.5 

200 

1.0 

4.0 

300 

1.0 

4.5 

500 

0.8 

4.6 


3.2 Pseudo code for the DPM test of independence algorithm 

Let @m denote the DPM parameters (&m = {l^n, • l l K 2 i U <f> 2 - wu ,..., wk 2 , ^ 1 ,^ 2 } if 
M = /, and ©m = {Pi, ...,Px> r i &■> w h d} if M = J). The algorithm of the DPM test 

of independence is described as follows: 

Step 0: Select initial values for M and @m- 

Step 1: Update ©m given M using the Gibbs sampling. 

Step 2: Update M given the parameters ©m- 

Step 2.1 : Generate proposed model status M' with P(M' = I) = P{M' = J) = 0.5. 

Step 2.2: If M = M'. then so back to Step 1. 

Step 2.3: If M = I and M' = J, then propose ® M > required for the joint DPM prior (Hi). 
Step 2.4: If M = J and M' = I , then propose ©required for the independent DPM prior 
(Ho)- 
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Step 2.5: Accept M' with probability min{l ,a(M,M')}. 

Step 3: Back to Step 1. 

The full conditionals requires for Step 1 are all standard and are given in Appendix [AT] for M = /, 
and Appendix |A.2| for M = J. Details on the RJMCMC steps are provided below in Section 3T 


3.3 Steps of the RJMCMC algorithm 


The parameter spaces under the independent and the joint DPM priors are different, so 
moving between these two parameter spaces becomes a trans-dimensional problem. Reversible 


jump MCMC (RJMCMC) was first introduced by Green (1995), which can be thought of as a 
generalized Metropolis-Hastings algorithm for the trans-dimensional updates. 

Under the current model status M, the propose model status M' is randomly assigned to be 
either / or J with acceptance probability min{l, a(M, M')}, where 


a(M, m’) 


hi 1 ' 71 m' ' Im | m' ' Pm' ->m 
Im ■ km ■ q M ' | M ■ Vm^m' 


(4) 


where Im and ttm are the likelihood function and the prior distribution under model M, q M * | M 
is the candidate distribution of the parameters when proposing for model M under model M, 
Pm^m' probability of proposing M’ conditional on the current status M, and |J| is the 

Jacobian. As M' is randomly picked from {/, J}, p M ^ M ’ and p M '^ M are equal in the algorithm. 
Note that when M = M , it becomes the usual fixed-dimensional MCMC algorithm as a(M, M ) = 
1; when M ^ M\ the candidate distribution of the parameters q is then for balancing the 
parameter spaces between the independent and joint models. 

Recall that @m and & M ' denote the DPM parameters under models M and M , respectively, 
and the truncated number K under both models are assigned to be identical. We first examine 
the case when X\ and X 2 are univariate random variables {D\ = D 2 = 1) with the current model 
status M = /, and the proposed model is M' = J. Denote the covariance matrix under the joint 
model as S = ( 5n g 1 2 1 2pj 5n ^ 2 22PJ ) • We assign the 2 x K mean vector /x to be the same in both the 
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independent and joint DPM models. Also, we assign the variances Sfj and Sf 2 , an d f to be the 
same across different model statuses. Therefore, this move only requires proposing the parameters 
under the joint DPM prior in ([2]): the cluster label g'j, p'j, the concentration parameter d'j, and 
the mixture weights w'j. The concentration parameter d'j is proposed by d'j ~ Gamma(d/, 1), 
where d/ is the mean of dj, and then the mixture probabilities w'j is proposed from the stick¬ 
breaking procedure with concentration parameter d'j. The cluster label g'j is proposed from the 
full conditional distribution given in Appendix [Aj The details of the mapping for each parameter 
is described in the end of this section. 

Conversely, if the current model status is M = J and the proposed model status is M' = I, the 
parameters of the independent model described in ([I]) are proposed as follow: The concentration 
parameter d'j ~ Gamma(dj,l), and the mixture weights w'j are again proposed by the stick¬ 
breaking procedure with concentration parameter d'j. The cluster label g'j is again proposed by 
the full conditional distribution given in Appendix |A| 

For dimension matching under the RJMCMC algorithm, the bijection map is described below 
for the case where M = I and M' = J. The reverse move uses the same map. Let 

Om = {/x, Sn, S22,D m/,dr,sr/} 
u = {p' j} w'j,d'j,g'j} 

0m> = {n,S 1 i 1 S 2 2 ,r,wj,dj 1 gj,pj} 

v! = {w'j, d'j, g'j}. 

Then we assign ©m = {9m, u}, ®m' = {Qm',u'}. The bijection function h has the form 

H®m) = h(0 M ,u) = ®m> = {8m',u'}, 

which is a one-to-one bijection map with: wi -> w}, dj —> d'j, gi —>• g'j, p'j -» pj, w'j —> wj, 
d'j —)> dj, and g'j —» g,j. Hence, the Jacobian |3| = 


9(6 M ,u) 


= 1. 





When D\ + Z) 2 > 2, the transition of the covariance matrices between the independent and 
joint models becomes more complicated as the off-diagonal elements are harder to propose than 
in the bivariate case. One way to alleviate this concern is to assume the covariance matrix S 
under the joint model is a block-diagonal matrix S = (^ g ), where S t is a Dj x Di covariance 
matrix of X,- L for i = 1,2. However, in the simulation study and the real data analysis of this 
article, we will focus on the case where X\ and X 2 are univariate random variables. 


4 Simulation Study 

The simulation study focuses on testing for dependence between two univariate variables. The 
objective is to compare the power of each method under linear and nonlinear dependence. In the 
following subsections, we introduce the data generation procedure, the competing methods, and 
the simulation results. 


4.1 Data generation 


The seven different types of data sets are simulated. Scenarios 5 and 6 are designed from 


Kaufman et al. (2013). 


1. Independent normal (Null): Xj ~ N(0,1), for j=1,2. 

2. Bivariate normal (BVN): (X\, X 2 ) ~ BVN 0, (*) , where p = 0.2. 

3. Horseshoe (HS): X\ ~ N(0,1), X 2 \ X\ ~ N(/uY 2 , 1), where p = 0.2. 

4. Cone: X, ~ U(0,1), X 2 | X 1 ~ N [0, ( P X\ + 0.1) 2 ], where p = 0.1. 

5. W: Xi ~ ± £?=! U(oi, ai + |), X 2 | W ~ U [3(Xf - , 3(1 + Xf — |)], where fll = -1, 

n is the number of samples, and a, = aj_i + -, for i > 1. 


6. Circle: (X U X 2 ) ~ ^ ^”= 1 BVN 
defined as in W. 




, where 9i = [sin(oj7r), cos(aj7r)], and ai is 
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Each scenario is generated with the algorithms introduced above with sample size n = 100, 200, 
and 500. Then for each dimension, we standardize the data to have mean zero and variance 
one. We plot the data when n = 200 in Figure [l] along with the true density. The responses 
are dependent for designs 2-6. Design 3-6 are all examples of the challenging dependent but 
uncorrelated random variables and thus the usual test of correlation will miss this dependence. 

Null BVN HS 



Figure 1: True log density (background color) and one simulated data set (points) for each simulation 
design. 
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4.2 Methods for testing of independence 

We compare six methods in the simulation study (described in detail in the Appendix). Each 
method is controlled to have type I error rate approximately equal to 0.05. 


1. Linear regression (LR): The model X-i = /?o + PiX\ + e, e ~ N(0,1) is fitted by least squares 
and the linear association is determined by the test of = 0. 


2. E-statistics (ES) (Szekely et al. 2007): The testing procedure is by calculating the distance 
covariance between X\ and X 2 . 


3. Heller-Heller-Gorfine method (HHG) (Heller et al. 2013): The test statistic is based on the 
sum of all likelihood ratio tests of 2 x 2 contingency tables formed by the pairwise distances 
within each of X\ and X^. 


4. Data Derived Partitions method (DDP) (Kaufman et al. 2013) with 3x3 contingency 


tables: The DDP method is similar to the HHG method, but only designed for univariate 
random variables. The test statistic is based on the sum of all likelihood ratio tests of 3 x 3 
contingency tables formed by the observed values. 


5. Maximal Information Coefficient method (MIC) (Reshef et ah|2011): It is a rank-order test 
statistic which is calculated from the largest achievable mutual information under different 
grid sizes. 


6. The DPM test of independence (DPM): The proposed test is described in Section [2j X is 
first marginally transformed to be standard normal distribution. The normal score trans¬ 
formation makes the proposed method a distribution-free testing procedure. Therefore, the 
threshold for the BF in Section [2] that controls Type I error can be determined by the per¬ 
mutations of the transformed data. The threshold T for the Bayes factor is computed from 
300 permutations of the sample. 
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4.3 Simulation results 


The results are presented in Table [2] with sample sizes n = 100, 200 and 500. The first three 
rows of the table are the type I error rate for each method under different samples sizes, which 
is controlled for all methods (Type I error rate is between 0.03 to 0.09). The following rows give 
the power of each method under different scenarios and sample sizes. It is clear that as the sample 
size n increases, the powers increase for all the methods except the LR method under the HS, 
Cone, and Circle scenarios because of the nonlinear associations of these scenarios. 

When the data are generated from bivariate normal distribution, the LR method has the 
highest power. This is expected because the LR method is theoretically the most powerful test 
under this scenario. The ES and DPM tests are the second best among other comparing tests. 

The DPM test outperforms all other methods when data are generated from the HS and the 
W shapes. Under the Cone shape data, the HHG and the DPM tests both perform well. For the 
Circle design, the HHG, DDP, and DPM tests all have power greater than 0.9 starting from small 
sample sizes, and the ES and MIC have lower power. 

In summary, the LR method is able to capture linear association but loses power in the 
nonlinear cases. The ES method is able to capture linear and nonlinear associations, but loses 
power in some of the nonlinear cases. The HHG and DDP methods both have high power in 
testing of nonlinear associations, but lose power in the linear association, especially the HHG 
method. The MIC method is a relatively conservative test compared to all other methods, and 


this problem is discussed by Heller et al. (2012). The proposed method not only shows the ability 
to capture the linear association, but is also powerful for detecting nonlinear associations in the 
simulation study. 
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Table 2: Power of each test (columns) for each simulation settings and sample size n (rows). A * indicates 
that the power is significantly different than the power of DPM test. 


Type 

n 

LR 

ES 

HHG 

DDP 

MIC 

DPM 

Null 

100 

0.06 

0.04 

0.03 

0.02 

0.03 

0.03 


200 

0.06 

0.05 

0.03 

0.02 

0.07 

0.05 


500 

0.09 

0.08 

0.05 

0.09 

0.02 

0.09 

BVN 

100 

0.57* 

0.49 

0.24* 

0.38* 

0.13* 

0.43 


200 

0.84* 

0.78 

0.37* 

0.61* 

0.25* 

0.76 


500 

0.99 

0.99 

0.83* 

0.99 

0.40* 

0.99 

HS 

100 

0.11* 

0.22* 

0.42 

0.39 

0.12* 

0.44 


200 

0.05* 

0.48* 

0.53* 

0.60* 

0.25* 

0.68 


500 

0.10* 

0.96 

0.99 

1.00 

0.42* 

1.00 

Cone 

100 

0.03* 

0.25* 

0.54* 

0.33 

0.17* 

0.36 


200 

0.08* 

0.56* 

0.87 

0.73* 

0.37* 

0.84 


500 

0.09* 

1.00 

1.00 

1.00 

0.80* 

1.00 

W 

100 

0.54* 

0.42* 

0.55* 

0.75* 

0.34* 

0.92 


200 

0.84* 

0.83* 

0.92* 

1.00 

0.70* 

1.00 


500 

1.00 

1.00 

1.00 

1.00 

0.98 

1.00 

Circle 

100 

0.00* 

0.00* 

0.96 

0.99 

0.20* 

0.99 


200 

0.00* 

0.22* 

1.00 

1.00 

0.37* 

1.00 


500 

0.00* 

1.00 

1.00 

1.00 

0.95* 

1.00 


5 Real data analysis 

We compare the six methods in the simulation study on the gene expression data set from 


Hughes et al. 

(2000 

). Studies of associations between genes can be found in 

de la Fuente et al. 


(2004) and Bhardwaj & Lu (2005). The number of observations is n = 300 for each gene, and we 


select 94 genes on chromosome 1 after removing samples with missing values. The objective is to 
test the pairwise associations within these 94 genes. A total of ( 9 2 4 ) = 4371 hypotheses tests of 
independence are performed. Because of the large number of tests, we control false discovery rate 
(FDR) at the 0.05 level rather than Type I error. The Bayesian FDR (BFDR) control procedure 
is applied ( Efron &; Tibshirani|2002 Newton et al.||2004 Storey et al.|2004 , |Muller et al.||2006 ) for 


the DPM test, and the Benjamini-Hochberg procedure (Benjamini &; Hochberg 1995) is applied 
for the other methods. 
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The Cohen’s k statistic (Cohen 1960) is used to measure agreement between tests. The k 
statistic is 


k = 


Pa-Pe 
1 - P, ' 


where P a is the proportion of agreements between the two methods among the N = 4371 tests, and 
P e is the theoretical proportion of agreements under independence. Larger values of k represents 
more agreement between the tests. The number of rejections among N = 4371 tests and the k 
statistics of pairwise methods are presented in Table [3j 

Table 3: Numbers of rejections (of the N = 4371 tests), and Cohen’s k statistics for each pair of methods. 


Methods 

Number of rejections 

LR 

ES 

HHG 

DDP 

MIC 

DPM 

LR 

2404 

1.000 

0.472 

0.301 

0.404 

0.082 

0.452 

ES 

3352 

- 

1.000 

0.686 

0.830 

0.036 

0.779 

HHG 

3442 

- 

- 

1.000 

0.751 

0.032 

0.720 

DDP 

3350 

- 

- 

- 

1.000 

0.036 

0.814 

MIC 

249 

- 

- 

- 

- 

1.000 

0.042 

DPM 

3231 

- 

- 

- 

- 

- 

1.000 


The At statistics show that the ES, HHG, DDP, and the DPM tests have similar testing powers in 
this gene expression data sets, and the number of rejections among these tests are similar (3231 
to 3442). The LR test only captures the linear associations between genes, and the MIC has the 
lowest power as in the simulation study. 

In Figure [2] we plot six pairs of genes where there are disagreements among the tests. In the 
upper two plots (gene 94 versus gene 8, and gene 88 versus gene 15), the associations between 
these pairs of genes are detected by the DPM test, but not the other tests. The figure shows 
that between gene 94 and gene 8, there is a horseshoe pattern of dependence, and a nonlinear 
relationship between gene 88 and gene 15. In the middle two plots (gene 17 versus gene 1, and 
gene 89 versus gene 24), the ES, HHG, DDP, and DPM tests all flag associations between genes, 
but not the LR and the MIC tests. The figure shows that gene 17 and gene 1 have a cone-shape 
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association, and genes 89 and 24 have a clustering relationship. The bottom two plots (gene 92 
versus gene 2 and gene 30 versus gene 6) are the cases where only the LR, ES and DPM tests flag 
associations between genes. These three tests are powerful in testing the linear associations, and 
the figure shows linear relationships between genes in these two pairs. 


94 vs 8 


88 vs 15 



i i i i i 
- 1.0 - 0.6 - 0.2 


Gene8 

17 vs 1 



Gene15 

89 vs 24 



Genel 


Gene24 



Gene2 


Gene6 


Figure 2: Six pairs of genes where there are disagreements among the tests. The red lines are the linear 
regression fitted lines. 


6 Conclusion 

We propose a nonparametric Bayesian test of dependence by calculating the Bayes factor using 
the Dirichlet process mixture model and the reversible jump MCMC algorithm. We compare 
our method with the linear model, distance correlation method, HHG, DDP, and MIC in the 
simulation study and also in the gene expression data sets. The simulation results show that the 
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proposed test is competitive in testing both linear and nonlinear relationships. 

In the gene expression data analysis, we performed 4371 multiple testing on the gene expression 
data in comparing pairwise genes. The proposed test shows similar performance with the distance 
correlation, DDP, and HHG methods, and detects some cases that other methods do not detect. 
It also shows that the proposed method is powerful on both linear and nonlinear relationships in 
the pairwise gene comparisons. 
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A Full conditional distributions 

A.l Full conditionals for the independent DPM prior 

Let X j = {Xij : i = 1, where X V] is the i th observation of Xj and N is the number 

of observations. The prior of Sj is Sj ~ IW Dj(pj, Wj). The full conditional distribution for each 
parameters under the independent DPM prior of Xj are 


Plj 

rest 

S 3 

rest 

£ 

II 

5^ 

rest) 

H 

3 

rest) 


MVN^.Kn^- 1 + nj'r'vjH E Xij), K-s- 1 + nj 1 )- 1 } 

i'9ij 


IW [N + K + Pj ,A] 


Uik °A-^u 


A 


9ij3 


> r mSj) n«=l I 0,(1 r m. ) Sj ) 


E=l [rii=i I 

tfijiXij | Hiji ^j) w ij 


Sj) n£i I M 1 - r q )Sj 


j jlV s j 


_j_ 

n r 


vij | rest ~ Beta 
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where l = m = 1 ,...,n r , % = l,-, AT, A = r ; V; Y , i X ;.y - /j, g .. j )(X ij - H gijj ) T + (1 - 

r) _1 Hijtfj + PjtFj, r Hj = = 0; 9ij is the cluster label of the observation, n r 

is the number of discrete r values, <pj is the Dj -dimensional multivariate normal density function, 
and (a, b ) is the tunning parameter of the stick-breaking algorithm. 


A.2 Full conditionals for the joint DPM prior 

Let X = {Xj : i = 1,..., A}, where X, is the i th observation of X and N is the number of 
observations. The prior of S is S ~ IWd 1 + d 2 (p, W). The full conditional distribution for each 
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parameters under the joint DPM prior of X are 
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jjVJsj 

N 


^ Tfl* — 0 + 1) ^ -^G?* > l) + d 
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Gamma it' + a — 1, b — log(l — v;) 


J=i 


where l = 1 m = 1 ,...,n r , i = 1 B = r 1 Eili ( x i ~ V gi )( X '< ~ V gi ) T + (1 - 

r) _1 Ea=i A iilif + pW, n/ = EiLiHSi = 0) 9* is the cluster label of the i th observation, n r is 
the number of discrete r values, (j) is the D-dimensional multivariate normal density function, and 
(a, b) is the tunning parameter of the stick-breaking algorithm. 


B Test of independence by E-statistics 


The test of independence by E-statistics, which calculates the distance covariance measures 


(dCov), was first introduced by Szekely et al. (2007). The dCov between two random variables 
(or vectors) X \ £ M p and £ l 9 with finite first moments is the nonnegative number defined 


as 


V 2 (X 1 ,X 2 ) = II f{X) - f 1 (X 1 )h(X 2 


(5) 


where 


is the L 2 -norm with weight function w. The w is described more details in Szekely 


et al. (2007), and in this article, we use the identical w as suggested. The empirical distance 
covariance of n observed samples V^{X\, X 2 ) is also defined in Szekely et al. (2007). A test 


18 




















statistic T(Xi, X 2 ,p, n) that rejects the null hypothesis that two random variables (or vectors) 


if 


nV^(X 1 ,X 2 

S 2 




has an asymptotic significance level at most a, and 


n 

S-2 = —y £ \X\k - Xn\p 7 

n z z —' w 


k,l =1 


9 7 \^2k — X-2l\q, 

n, 

k,l =1 


where | • | r is the L r -norm, and is the standard normal distribution. However, the test decision 
based on is quite conservative for many distributions, so the testing decision in this article is 
determined by 300 permutation samples under the null hypothesis with Type I error rate p = 0.05 
level under each data set. The R package ” energy ” with function ” indep.test” is used in the anal¬ 
ysis. 


A distribution-free version of distance covariance was also introduced in (Szekely & Rizzo 


2009), which uses the ranks of the observations instead of the values. In this article, the 


distribution-free version of dCov performs similar to the original version, so we only present 
the original version of the dCov. 

C Heller-Heller-Gorfine test of association based on Euclidean 
distance metric 


This test was first introduced by Heller et al. (2013). The test is based on the pairwise 


distances within X\ and X2 respectively. Let the pairwise distances within Xj. j = 1 , 2, denoted 
as {d(Xij, Xjj) : i. i e (1, ...,n}}, where Xij is the i th observation in Xj , and d(-, •) is assigned 
to be the Euclidean distance metric in this article. The idea is to first randomly select two samples 
i and 1 in each of X\ and X 2 , and then use the distances d(Xn,X i > 1 ) and d(Xi2,X j 2 ) as the 
references to construct a 2 x 2 contingency table among the remaining n — 2 samples. Then 
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the likelihood ratio test of independence for summarizing this table denoted as S(i,i ) gives test 
statistic 

n n 

T = E E ®(m)- 

The 0.05 Type I error rate is controlled by 300 permutation samples under the null hypothesis of 
each data set. The R package ”HHG” with function ”hhg.test” is used in the analysis. 


A distribution-free version of HHG test was suggested in (Heller et al. 2013) for comparison 


We found that in this article the results are similar to the original version of HHG test. Therefore, 
we only present the original version of the HHG results. 

D Distribution-free tests of association based on data derived 
partitions 


This test was first introduced in (Kaufman et al. 2013), which is designed for testing two 


univariate random variables (i.e. D\ = D 2 = 1). The idea follows the HHG test but with 
different ways of forming the contingency tables. The data values are now used directly instead of 
using the distances. In forming a 2 x 2 contingency table, one sample point is randomly selected 
as the reference, and then a 2 x 2 contingency table can be constructed and a test statistic of this 
table is calculated. The same procedure can be applied to form m x m contingency tables (m > 2) 
with randomly selected m — 1 data values as references. More specifically, the mxm contingency 
table is defined by the range (— 00 , X*^), (X*^ 2 p X*^),..., (X*^^, 00 ) hiXi, and (—< 00 , X*^), 
(X^),^2(3)),..., pr* (m — i),°c) in X2, where X*^ is the r th ordered selected observation in Xj, 
j = 1,2. In this article, the summation of the likelihood ratio test statistics with each 3x3 
(m = 3) contingency table is used as the test statistics. This setting was shown to perform the 


best in most of the scenarios in (Kaufman et al. 2013). The testing decision is again based on 


300 permutation samples under the null hypothesis for each data sets controlled under 0.05 Type 
I error rate in this study. The R package ” HHG” with function ” xdp.test” is used in this article. 
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E Maximal information coefficient for measuring dependence of 


two variables 


The Maximal Information Coefficient (MIC) method is first introduced by 
The intuition is that if a relationship exists between two univariate random variables, then a grid 
(a square) can be drawn on the scatter-plot of these two variables which can partition the data to 
capture the relationship. The method explores all size of grids up to a maximal grid resolution. 
For grid size x-by -y, the largest achievable normalized mutual information (MI) is denoted as m xy 


Reshef et al. (2011 


m. xy = max{J X y}/log(min{x, y}), 


where computation of I xy can be found in (Jiang et al. 2010), and the MIC is the maximum of 


m xy over all pair (x,y) such that xy < B, where B depends on the sample size n. In this article, 
we use B = n 0 ' 6 as suggested in Reshef et al. (2011), and the p-value is calculated from the p-value 


table given in www.exploredata.net/Downloads/P — Value —Tables. The R package ” minerva 1 ' 
with function ’’mine” is used in this article. 
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